home *** CD-ROM | disk | FTP | other *** search
Text File | 1996-07-06 | 4.8 KB | 113 lines | [TEXT/R*ch] |
- About DMultiStringLocator
- =========================
- Written July 6, 1996 by Eric Gundrum <eric@teamworks.com>
-
- Preface
- -------
- This document describes the DMultiStringLocator class. It contains these
- sections:
-
- - What is DMultiStringLocator
- - Implementation Notes
- - About the Sample Code
- - Use Restrictions
-
-
- What is DMultiStringLocator
- ---------------------------
- DMultiStringLocator is a C++ class to provide fast searching for one or
- more strings. It is an implementation of a table-based algorithm
- credited to Aho and Corasick and described in "Practical Algorithms for
- Programmers" by Andrew Binstock and John Rex (published in 1995 by
- Addison Wesley).
-
- DMultiStringLocator is designed to search for more than one string at a
- time. Searching for multiple strings works nearly as fast as searching
- for a single string. Using a table-lookup mechanism eliminates any
- unnecessary backtracking when a partial match fails, accounting for much
- of the speed.
-
- This algorithm may not be the fastest available, but it should be
- sufficient for most practical problems. It is general enough to be used
- when searching for a single string, as well as for more than one in the
- same target text.
-
-
- Implementation Notes
- --------------------
- The current implementation expects a list of DSearchString objects when
- the DMultiStringLocator object is instantiated. Internal state tables
- are built from this list in the constructor. (This allocates memory;
- about eight times the combined length of the search strings and at least
- two times the alphabet size.) Changing the list requires constructing a
- new search object (and its state tables).
-
- A DSearchString object, based on PowerPlant's LStr255 class, contains a
- ReportFound() virtual member function that must be overridden in a
- derived class. ReportFound() is executed for each found string. (This is
- a performance bottleneck; keep ReportFound() short and fast.) A
- parameter to ReportFound() is the position in the string of the last
- character of the found string. Use this value to locate the found
- strings after the search phase.
-
- A search is initiated with a call to SearchBuffer(), which will search a
- RAM-based buffer (by walking a pointer through the buffer). Searching
- more data than will fit in one buffer can be accomplished with repeated
- calls to SearchBuffer(), but you must deal with a match string that is
- split over two buffers.
-
- To help handle strings split across the buffer, SearchBuffer() returns
- the maximum number of characters which might belong to a string found at
- the end of the current buffer. These characters should be included in
- the next buffer. However, if these characters also include other
- complete strings, those strings will be found twice.
-
- For example, consider searching for 'foobarooba' and 'bar', where the
- first buffer ends with 'foobaroo'. SearchBuffer() will return 8,
- indicating that the second buffer should begin 8 characters before the
- end of the first. However, such as search will find 'bar' twice. This is
- a design flaw I have not yet resolved. (It is not significant to my
- use.) I suggest working around it by looking for multiple hits on the
- same string in a post process phase (by tracking the hit position). Or,
- handle it in ReportFound() if performance is not an issue.
-
- Another limitation of this implementation is that it searches for a byte
- sequence. It does not support case-insensitive searches. However, that
- is easily remedied by changing the case of the search strings and input
- text.
-
- DMultiStringLocator currently relies on a few PowerPlant classes,
- including LList, LListIterator, and LString. With a bit of effort, these
- classes could be replaced with STL classes. I'll leave that for a future
- project.
-
-
- About the Sample Code
- ---------------------
- The sample code consists of the CSearchWindow class to demonstrate the
- search engine and a supporting PowerPlant application (PPShell).
-
- CSearchWindow builds the search list from user input (in DoSearch()). It
- then reads the target file into a buffer and performs a search on that
- buffer (in SearchFile()). If the file contains more data, the process is
- repeated until all is read. Note that SearchFile() will overlay the
- buffer ends to a deal with match strings split over two buffers.
-
- CSearchString is an important support class used by CSearchWindow.
- Derived from DSearchString, CSearchString demonstrates how to minimally
- collect the results of a search with the ReportFound() function. Here
- ReportFound() is used to count the number of hits. It also tracks the
- position of found strings to prevent reporting them twice.
-
-
- Use Restrictions
- ----------------
- I place no restrictions on how this code may be used or distributed. I
- ask only that my copyright statement be retained in the source files of
- all derivative works. (I put a lot of work into this class and I want
- people to know who did it.)
-
- If you have comments or want to report bugs, send mail to
- <eric@teamworks.com>.
-
-